20 research outputs found
A Scalable Blocking Framework for Multidatabase Privacy-preserving Record Linkage
Today many application domains, such as national statistics,
healthcare, business analytic, fraud detection, and national
security, require data to be integrated from multiple databases.
Record linkage (RL) is a process used in data integration which
links multiple databases to identify matching records that belong
to the same entity. RL enriches the usefulness of data by
removing duplicates, errors, and inconsistencies which improves
the effectiveness of decision making in data analytic
applications.
Often, organisations are not willing or authorised to share the
sensitive information in their databases with any other party due
to privacy and confidentiality regulations. The linkage of
databases of different organisations is an emerging research area
known as privacy-preserving record linkage (PPRL). PPRL
facilitates the linkage of databases by ensuring the privacy of
the entities in these databases.
In multidatabase (MD) context, PPRL is significantly challenged
by the intrinsic exponential growth in the number of potential
record pair comparisons. Such linkage often requires significant
time and computational resources to produce the resulting
matching sets of records. Due to increased risk of collusion,
preserving the privacy of the data is more problematic with an
increase of number of parties involved in the linkage process.
Blocking is commonly used to scale the linkage of large
databases. The aim of blocking is to remove those record pairs
that correspond to non-matches (refer to different entities).
Many techniques have been proposed for RL and PPRL for blocking
two databases. However, many of these techniques are not suitable
for blocking multiple databases. This creates a need to develop
blocking technique for the multidatabase linkage context as
real-world applications increasingly require more than two
databases.
This thesis is the first to conduct extensive research on
blocking for multidatabase privacy-preserved record linkage
(MD-PPRL). We consider several research problems in blocking of
MD-PPRL. First, we start with a broad background literature on
PPRL. This allow us to identify the main research gaps that need
to be investigated in MD-PPRL. Second, we introduce a blocking
framework for MD-PPRL which provides more flexibility and control
to database owners in the block generation process. Third, we
propose different techniques that are used in our framework for
(1) blocking of multiple databases, (2) identifying blocks that
need to be compared across subgroups of these databases, and (3)
filtering redundant record pair comparisons by the efficient
scheduling of block comparisons to improve the scalability of
MD-PPRL. Each of these techniques covers an important aspect of
blocking in real-world MD-PPRL applications. Finally, this thesis
reports on an extensive evaluation of the combined application of
these methods with real datasets, which illustrates that they
outperform existing approaches in term of scalability, accuracy,
and privacy
Privacy-preserving Deep Learning based Record Linkage
Deep learning-based linkage of records across different databases is becoming
increasingly useful in data integration and mining applications to discover new
insights from multiple sources of data. However, due to privacy and
confidentiality concerns, organisations often are not willing or allowed to
share their sensitive data with any external parties, thus making it
challenging to build/train deep learning models for record linkage across
different organizations' databases. To overcome this limitation, we propose the
first deep learning-based multi-party privacy-preserving record linkage (PPRL)
protocol that can be used to link sensitive databases held by multiple
different organisations. In our approach, each database owner first trains a
local deep learning model, which is then uploaded to a secure environment and
securely aggregated to create a global model. The global model is then used by
a linkage unit to distinguish unlabelled record pairs as matches and
non-matches. We utilise differential privacy to achieve provable privacy
protection against re-identification attacks. We evaluate the linkage quality
and scalability of our approach using several large real-world databases,
showing that it can achieve high linkage quality while providing sufficient
privacy protection against existing attacks.Comment: 11 page
Evaluation measure for group-based record linkage
Traditionally, record linkage is concerned with linking pairs of records across data sets and the classification of such pairs into matches (assumed to refer to the same individual) and non-matches (assumed to refer to different individuals). Increasingly, however, more complex data sets are being linked where often the aim is to identify groups, or clusters, of records that refer to the same individual or to a group of related individuals. Examples include finding the records of all births to the same parents or all medical records generated by members of the same family. When ground truth data in the form of known true matches and non-matches are available, then linkage quality is traditionally evaluated based on the classified versus the true matches (links) using measures such as precision (also known as the positive predictive value) and recall (also known as sensitivity or the true positive rate). The quality of clusters generated in record linkage is of high importance, since the comparison of different linkage methods is largely based on the values obtained by such evaluation measures. However, minimal research has been conducted thus far to evaluate the suitability of existing evaluation measures in the context of linking groups of records. As we show, evaluation measures such as precision and recall are not suitable for evaluating groups of linked records because they evaluate the quality of individually linked record pairs rather than the quality of records grouped into clusters. We highlight the shortcomings of traditional evaluation measures and then propose a novel approach to evaluate cluster quality in the context of group-based record linkage. We empirically evaluate our proposed approach using real-world data and show that it better reflects the quality of clusters generated by a group-based record linkage technique
Private Graph Data Release: A Survey
The application of graph analytics to various domains have yielded tremendous
societal and economical benefits in recent years. However, the increasingly
widespread adoption of graph analytics comes with a commensurate increase in
the need to protect private information in graph databases, especially in light
of the many privacy breaches in real-world graph data that was supposed to
preserve sensitive information. This paper provides a comprehensive survey of
private graph data release algorithms that seek to achieve the fine balance
between privacy and utility, with a specific focus on provably private
mechanisms. Many of these mechanisms fall under natural extensions of the
Differential Privacy framework to graph data, but we also investigate more
general privacy formulations like Pufferfish Privacy that can deal with the
limitations of Differential Privacy. A wide-ranging survey of the applications
of private graph data release mechanisms to social networks, finance, supply
chain, health and energy is also provided. This survey paper and the taxonomy
it provides should benefit practitioners and researchers alike in the
increasingly important area of private graph data release and analysis
Evaluating hardening techniques against cryptanalysis attacks on Bloom filter encodings for record linkage
Introduction
Due to privacy concerns personal identifiers used for linking data often have to be encoded (masked) before being linked across organisations. Bloom filter (BF) encoding is a popular privacy technique that is now employed in real-world linkage applications. Recent research has however shown that BFs are vulnerable to cryptanalysis attacks.
Objectives and Approach
Attacks on BFs either exploit that encoding frequent plain-text values (such as common names) results in corresponding frequent BFs, or they apply pattern mining to identify co-occurring BF bit positions that correspond to frequent encoded q-grams (sub-strings). In this study we empirically evaluated the privacy of individuals encoded in BFs against two recent cryptanalysis attack methods by Christen et al. (2017/2018). We used two snapshots of the North Carolina Voter Registration database for our evaluation, where pairs of records corresponding to the same voter (with name or address variations) resulted in files with 222,251 BFs and 224,061 plain-text records, respectively.
Results
We encoded between two and four of the fields first and last name, street, and city into one BF per record. For combinations of three and four fields all plain-text values and BFs were unique, challenging any frequency-based attack. For hardening BFs, different suggested methods (balancing, random hashing, XOR, BLIP, and salting) were applied.
Without any hardening applied up to 20.7% and 5% of plain-text values were correctly re-identified as 1-to-1 matches by both the pattern-mining and frequency-based attack methods, respectively. No more than 5\% correct 1-to-1 re-identification matches were achieved with the frequency-based attack on BFs encoding two fields when either balancing, random hashing, or XOR folding was applied; while the pattern-mining based attack was not successful in any correct re-identifications for any hardening technique.
Conclusion/Implications
Given that BF encoding is now being employed in real-world linkage applications, it is important to study the limits of this privacy technique. Our experimental evaluation shows that although basic BFs without hardening technique are susceptible to cryptanalysis attacks, some hardening techniques are able to protect BFs against these attacks
Towards a 'smart' cost-benefit tool: using machine learning to predict the costs of criminal justice policy interventions
BACKGROUND: The Manning Cost–Benefit Tool (MCBT) was developed to assist criminal justice policymakers, policing organisations and crime prevention practitioners to assess the benefits of different interventions for reducing crime and to select those strategies that represent the greatest economic return on investment. DISCUSSION: A challenge with the MCBT and other cost–benefit tools is that users need to input, manually, a considerable amount of point-in-time data, a process that is time consuming, relies on subjective expert opinion, and introduces the potential for data-input error. In this paper, we present and discuss a conceptual model for a ‘smart’ MCBT that utilises machine learning techniques. SUMMARY: We argue that the Smart MCBT outlined in this paper will overcome the shortcomings of existing cost–benefit tools. It does this by reintegrating individual cost–benefit analysis (CBA) projects using a database system that securely stores and de-identifies project data, and redeploys it using a range of machine learning and data science techniques. In addition, the question of what works is respecified by the Smart MCBT tool as a data science pipeline, which serves to enhance CBA and reconfigure the policy making process in the paradigm of open data and data analytics.This project was funded by the Economic & Social Research Council grant
(ESRC Reference: ES/L007223/1) titled ‘University Consortium for EvidenceBased Crime Reduction’, the Australian National University’s Cross College
Grant and the Jill Dando Institute of Security and Crime Science
Use of Data Mining Methodologies in Evaluating Educational Data
Abstract- Currently, online learning has gained a huge recognition within the higher education context and it has become a vital need in the current society to find such improvements, to increase the level of knowledge in people. In the modern society e-learning is recognized highly where it connects the students with the learning resources limitlessly. People have introduced various Learning Management Systems (LMS) in order to overcome the problem of managing large information sources and they are currently playing a remarkable role in e-learning environments. The lack of knowledge for employing learning methodologies in an accurate manner, has currently made these e-learning systems to face many problems. Even though LMSs enable the teachers to manage diverse educational materials in a much easier manner because of the differences in accessibility levels to the learning resources and study materials; it has become an unsolved problem to view the overall performance of each student in accordance with the behaviour of the student which is indicates the actual image of student learning capacity, on the course module. So it has become a massive challenge to cover the actual needs of the learners through the e-learning systems. Due to different learning patterns of students, it is becoming a vital need to understand the student performance, in a much more detailed manner. Getting a proper understanding on a student overall performance which is based on the amount of information that he or she has gathered through the online resources, will help the teachers and the tutors to identify the different learning capacities of the students and will be able to provide the necessary guidance to improve their capabilities. To improve the learning capabilities of the students, the teachers and tutors should be capable of monitoring the overall performance of each student, separately and dynamically adjus
Privacy-Preserving Temporal Record Linkage
Record linkage (RL) is the process of identifying matching records from different databases that refer to the same entity. It is common that the attribute values of records that belong to the same entity do evolve over time, for example people can change their surname or address. Therefore, to identify the records that refer to the same entity over time, RL should make use of temporal information such as the time-stamp of when a record was created and/or update last. However, if RL needs to be conducted on information about people, due to privacy and confidentiality concerns organizations are often not willing or allowed to share sensitive data in their databases, such as personal medical records, or location and financial details, with other organizations. This paper is the first to propose a privacy-preserving temporal record linkage (PPTRL) protocol that can link records across different databases while ensuring the privacy of the sensitive data in these databases. We propose a novel protocol based on Bloom filter encoding which incorporates the temporal information available in records during the linkage process. Our approach uses homomorphic encryption to securely calculate the probabilities of entities changing attribute values in their records over a period of time. Based on these probabilities we generate a set of masking Bloom filters to adjust the similarities between record pairs. We provide a theoretical analysis of the complexity and privacy of our technique and conduct an empirical study on large real databases containing several millions of records. The experimental results show that our approach can achieve better linkage quality compared to non-temporal PPRL while providing privacy to individuals in the databases that are being linked.This work was funded by the Australian Research Council under Discovery Projects DP130101801 and DP160101934